Search CORE

9 research outputs found

Approximate word matches between two random sequences

Author: Burden Conrad J.
Kantorovitz Miriam R.
Wilson Susan R.
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 21/01/2008
Field of study

Given two sequences over a finite alphabet

\mathcal{L}

, the

D_2

statistic is the number of

m

-letter word matches between the two sequences. This statistic is used in bioinformatics for expressed sequence tag database searches. Here we study a generalization of the

D_2

statistic in the context of DNA sequences, under the assumption of strand symmetric Bernoulli text. For

k<m

, we look at the count of

m

-letter word matches with up to

k

mismatches. For this statistic, we compute the expectation, give upper and lower bounds for the variance and prove its distribution is asymptotically normal.Comment: Published in at http://dx.doi.org/10.1214/07-AAP452 the Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

The Australian National University

Recommended from our members

Computing DNA Duplex Instability Profiles Efficiently with a Two-State Model: Trends of Promoters and Binding Sites

Author: Gelev Vladimir
Kantorovitz Miriam R
Rapti Zoi
Usheva Anny
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 27/04/2011
Field of study

Background: DNA instability profiles have been used recently for predicting the transcriptional start site and the location of core promoters, and to gain insight into promoter action. It was also shown that the use of these profiles can significantly improve the performance of motif finding programs. Results: In this work we introduce a new method for computing DNA instability profiles. The model that we use is a modified Ising-type model and it is implemented via statistical mechanics. Our linear time algorithm computes the profile of a 10,000 base-pair long sequence in less than one second. The method we use also allows the computation of the probability that several consecutive bases are unpaired simultaneously. This is a feature that is not available in other linear-time algorithms. We use the model to compare the thermodynamic trends of promoter sequences of several genomes. In addition, we report results that associate the location of local extrema in the instability profiles with the presence of core promoter elements at these locations and with the location of the transcription start sites (TSS). We also analyzed the instability scores of binding sites of several human core promoter elements. We show that the instability scores of functional binding sites of a given core promoter element are significantly different than the scores of sites with the same motif occurring outside the functional range (relative to the TSS). Conclusions: The time efficiency of the algorithm and its genome-wide applications makes this work of broad interest to scientists interested in transcriptional regulation, motif discovery, and comparative genomics

Harvard University - DASH

Springer - Publisher Connector

Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences

Author: A Barbour
A Christoffels
CJ Burden
Conrad J Burden
J Burke
JE Carpenter
L Florea
M Kimura
Miriam R Kantorovitz
MR Kantorovitz
MS Waterman
OM Melko
RA Lippert
S Vinga
SF Altschul
Sylvain Forêt
TJ Wu
W Hide
WJ Conover
WJ Kent
WR Pearson
Z Zhang
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: The number of k-words shared between two sequences is a simple and effcient alignment-free sequence comparison method. This statistic, D(2), has been used for the clustering of EST sequences. Sequence comparison based on D(2 )is extremely fast, its runtime is proportional to the size of the sequences under scrutiny, whereas alignment-based comparisons have a worst-case run time proportional to the square of the size. Recent studies have tackled the rigorous study of the statistical distribution of D(2), and asymptotic regimes have been derived. The distribution of approximate k-word matches has also been studied. RESULTS: We have computed the D(2 )optimal word size for various sequence lengths, and for both perfect and approximate word matches. Kolmogorov-Smirnov tests show D(2 )to have a compound Poisson distribution at the optimal word size for small sequence lengths (below 400 letters) and a normal distribution at the optimal word size for large sequence lengths (above 1600 letters). We find that the D(2 )statistic outperforms BLAST in the comparison of artificially evolved sequences, and performs similarly to other methods based on exact word matches. These results obtained with randomly generated sequences are also valid for sequences derived from human genomic DNA. CONCLUSION: We have characterized the distribution of the D(2 )statistic at optimal word sizes. We find that the best trade-off between computational efficiency and accuracy is obtained with exact word matches. Given that our numerical tests have not included sequence shuffling, transposition or splicing, the improvements over existing methods reported here underestimate that expected in real sequences. Because of the linear run time and of the known normal asymptotic behavior, D(2)-based methods are most appropriate for large genomic sequences

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

The Australian National University

Computing DNA duplex instability profiles efficiently with a two-state model: trends of promoters and binding sites

Author: A Krueger
A Usheva
Anny Usheva
BS Alexandrov
C Bi
C Bi
CH Choi
CH Choi
CJ Benham
D Jost
D Poland
DB Nikolov
E Protozanova
E Tøstesen
F Liu
G Kalosakas
H Wakaguri
HB Houbaviy
J SantaLucia
K Brick
M Peyrard
Miriam R Kantorovitz
P Yakovchuk
R Gordan
R Rohs
T Abeel
T Abeel
T Ambjörnsson
T Dauxois
T Hwa
T van Erp
Vladimir Gelev
X Wang
Z Rapti
Zoi Rapti
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Average number of facets per cell in tree-structured vector quantizer partitions

Author: Kenneth Zeger
Miriam R. Kantorovitz
Publication venue
Publication date: 01/01/1993
Field of study

Abstiact-Upper and lower bounds are derived for the average number of facets per cell in the encoder partition of binary tree-structured vector quantizers. The achievability of the bounds is described as well. It is shown in particular that the average number of facets per cell for unbalanced trees must lie asymptotically between 3 and 4 in R2, and each of these bounds can be achieved, whereas for higher dimensions it is shown that an arbitrarily large percentage of the cells can each have a linear number (in codebook size) of facets. Analogous results are also indicated for balanced trees. Index Terns-Tree-structured vector quantization, data compression, computational geometry. I

CiteSeerX

Motif-Blind, Genome-Wide Discovery of cis-Regulatory Modules in Drosophila and Mouse

Author: Göttgens Berthold
Halfon Marc S.
Kantorovitz Miriam R.
Kazemian Majid
Kinston Sarah
Miranda-Saavedra Diego
Robinson Gene E.
Sinha Saurabh
Zhu Qiyun
Publication venue: Elsevier Inc.
Publication date: 01/10/2009
Field of study

SummaryWe present new approaches to cis-regulatory module (CRM) discovery in the common scenario where relevant transcription factors and/or motifs are unknown. Beginning with a small list of CRMs mediating a common gene expression pattern, we search genome-wide for CRMs with similar functionality, using new statistical scores and without requiring known motifs or accurate motif discovery. We cross-validate our predictions on 31 regulatory networks in Drosophila and through correlations with gene expression data. Five predicted modules tested using an in vivo reporter gene assay all show tissue-specific regulatory activity. We also demonstrate our methods' ability to predict mammalian tissue-specific enhancers. Finally, we predict human CRMs that regulate early blood and cardiovascular development. In vivo transgenic mouse analysis of two predicted CRMs demonstrates that both have appropriate enhancer activity. Overall, 7/7 predictions were validated successfully in vivo, demonstrating the effectiveness of our approach for insect and mammalian genomes

Elsevier - Publisher Connector

PubMed Central